Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 11, 2025

  • Investigate and understand the issue with empty capture groups when NonBacktracking is used with newline at end
  • Identify root cause in DefaultInputReader.GetPositionId and FindSubcaptures method
  • Create focused test cases that reproduce the issue for all anchor combinations
  • Fix the bug by conditioning special \n handling on _containsEndZAnchor for both match finding and capture tracking
  • Build and test the changes
  • Run full test suite to verify no regressions (29,315 functional tests + 1,005 unit tests passed)
  • Address follow-up concerns about \z anchor behavior

Summary

Fixed a regression in NonBacktracking regex where capture groups would be empty when the input ends with a newline character.

Root Cause: The DefaultInputReader.GetPositionId method was unconditionally applying special handling for '\n' at the end of input (returning _minterms.Length instead of the normal minterm ID). This special handling is specifically designed for the \Z anchor to support lookahead semantics, but was being applied to ALL anchored patterns, causing incorrect derivative transitions and empty capture groups.

Solution:

  1. In DefaultInputReader.GetPositionId: condition the special '\n' handling on _containsEndZAnchor (only for \Z anchor)
  2. In FindSubcaptures: use DefaultInputReader when _containsEndZAnchor is true (needs special handling), otherwise use NoZAnchorOptimizedInputReader (normal handling)

This ensures:

  • Patterns with \Z or $ (non-multiline) use special '\n' handling for correct lookahead semantics
  • Patterns with \z, ^, or no end anchor use normal '\n' handling for correct capture tracking

Testing:

  • Added regression test cases for all anchor combinations: ^(A)(\s), ^(A)(\s)\z, ^(A)(\s)$, ^(A)(\s)\Z, and (A)(\s)\z
  • All 29,315 functional tests pass
  • All 1,005 unit tests pass
  • Verified existing \Z anchor behavior remains correct
Original prompt

This section details on the original issue you should resolve

<issue_title>Newline causes empty regex capture groups when NonBacktracking in dotnet 9 and above</issue_title>
<issue_description>### Description

Found while upgrading 8 -> 10, but appears in 9 as well. I'm fairly sure I can't find an explanation of it in the documentation.

Both Nonbacktracking and having the final character be a \n rather than any other kind of whitespace appear to be critical to triggering the bug. Nonbacktracking is in there because this case was cut down from a much larger regular expression while trying to diagnose this. It is important to the larger expression for performance.

Reproduction Steps

`using System.Text.RegularExpressions;

string line = "A\n";
string expression = "^(A)(\s)";
Match match = Regex.Match(line, expression, RegexOptions.IgnoreCase | RegexOptions.NonBacktracking);
if (!match.Success)
{
System.Environment.Exit(1);
}

Console.WriteLine($"reg: '{match.Groups[1]}' '{match.Groups[2]}'");`

Expected behavior

reg: 'A' ' '

The two characters of input are each matched and end up captured in a group.

Actual behavior

'reg: '' ''

Both capture groups end up empty, even though the regex has matched.

Regression?

Works as expected in dotnet 8.

Known Workarounds

Applying Trim() to the string rather than relying on regex matching of the start and end, or applying RegexOptions.Multiline

Configuration

Dotnet 10.0.0-rc.1.25451.107 on Windows x64 in VS Insiders.

As above - this code has been in place for many years and works on dotnet 8 and many previous versions.

Other information

No response</issue_description>

<agent_instructions>This will need investigation to determine the root cause. I suspect it was introduced in #102655</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes #120202


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

The issue was in DefaultInputReader.GetPositionId which was applying special
handling for '\n' at the end of input unconditionally. This special handling
is only needed for the \Z anchor. Added a check for _containsEndZAnchor to
only apply the special handling when needed.

Also added regression tests to verify the fix.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix empty regex capture groups with NonBacktracking in dotnet Fix NonBacktracking regex empty capture groups with newline at end of input Oct 11, 2025
Copilot AI requested a review from stephentoub October 11, 2025 16:39
Copilot finished work on behalf of stephentoub October 11, 2025 16:39
@stephentoub stephentoub requested a review from veanes October 11, 2025 20:05
@stephentoub stephentoub marked this pull request as ready for review October 11, 2025 20:05
@Copilot Copilot AI review requested due to automatic review settings October 11, 2025 20:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a regression in .NET 9 where NonBacktracking regex capture groups were incorrectly returned as empty strings when the input ended with a newline character, even though the overall match succeeded.

  • Modified the condition in DefaultInputReader.GetPositionId to only apply special \n handling when the pattern actually contains the \Z anchor
  • Added regression test cases to verify the fix works correctly
  • Ensures existing \Z anchor behavior remains unchanged

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/SymbolicRegexMatcher.cs Fixed the condition to only apply special newline handling when \Z anchor is present
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Groups.Tests.cs Added regression test cases for the fixed capture group behavior

@stephentoub
Copy link
Member

cc: @ieviev

@ieviev
Copy link
Contributor

ieviev commented Oct 16, 2025

i found another inconsistency:

  • ^(A)(\s) does not set capture groups
  • ^(A)(\s)\z does not set capture groups
  • ^(A)(\s)$ does set capture groups
  • ^(A)(\s)\Z does set capture groups

this might also be related to the handling of \Z

@stephentoub
Copy link
Member

@copilot, can you investigate that as well?

Copy link
Contributor Author

Copilot AI commented Oct 16, 2025

@copilot, can you investigate that as well?

I've already investigated and fixed the issue described in #120202 (empty capture groups with NonBacktracking when input ends with '\n'). The fix conditions the special '\n' handling on _containsEndZAnchor and all tests pass.

Could you clarify what specific aspect you'd like me to investigate further?

Corrected the issue number from #110604 to #120202 in the regression test comment.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot finished work on behalf of stephentoub October 16, 2025 12:16
Copilot AI requested a review from stephentoub October 16, 2025 12:16
@stephentoub
Copy link
Member

@copilot, please investigate the follow-up concern raised by @ieviev:

i found another inconsistency:
^(A)(\s) does not set capture groups
^(A)(\s)\z does not set capture groups
^(A)(\s)$ does set capture groups
^(A)(\s)\Z does set capture groups
this might also be related to the handling of \Z

The issue was that FindSubcaptures was using DefaultInputReader which applies
special \n handling for \Z anchor. This interfered with capture tracking for
patterns with beginning anchors but without \Z.

The fix is to make FindSubcaptures always use NoZAnchorOptimizedInputReader,
which doesn't apply the special \n handling. This allows:
- Match finding to use the special \n handling for \Z anchor (correct)
- Capture tracking to use the actual character without special handling (correct)

Added regression tests for all anchor combinations.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 16, 2025

@copilot, please investigate the follow-up concern raised by @ieviev:

i found another inconsistency:
...

I've investigated and fixed the inconsistency. The root cause was that FindSubcaptures was using DefaultInputReader which applies special '\n' handling for \Z anchor. This interfered with capture tracking for patterns without \Z.

The fix (commit 4fe286e) makes FindSubcaptures always use NoZAnchorOptimizedInputReader to avoid the special '\n' handling during capture tracking, while preserving it for match finding.

All anchor combinations now work correctly:

  • ^(A)(\s)
  • ^(A)(\s)\z
  • ^(A)(\s)$
  • ^(A)(\s)\Z

Added regression tests for all these patterns.

Copilot finished work on behalf of stephentoub October 16, 2025 12:53
@stephentoub
Copy link
Member

@ieviev, does that sound right?

@ieviev
Copy link
Contributor

ieviev commented Oct 16, 2025

@copilot
(A)(\s)\z also returns empty captures, test this case as well

@ieviev
Copy link
Contributor

ieviev commented Oct 16, 2025

if (A)(\s)\z works then i believe it should be correct,
the special case is only for the large \Z, since it's supporting a lookahead that would otherwise not be supported at all

@stephentoub
Copy link
Member

@copilot, please see @ieviev's comments

Copy link
Contributor Author

Copilot AI commented Oct 16, 2025

@copilot, please see @ieviev's comments

I don't see @ieviev's comments in the thread. Could you please share what feedback they provided so I can address it?

Copilot finished work on behalf of stephentoub October 16, 2025 16:15
@stephentoub
Copy link
Member

@copilot, they wrote:

(A)(\s)\z also returns empty captures, test this case as well
if (A)(\s)\z works then i believe it should be correct,
the special case is only for the large \Z, since it's supporting a lookahead that would otherwise not be supported at all

Based on feedback from @ieviev, the special \n handling should only apply for
\Z anchor (uppercase), not for \z or other anchors. This ensures:
- Patterns with \Z use special handling (for lookahead semantics)
- Patterns with \z, ^, or no end anchor use normal handling (correct captures)

Added test case for (A)(\s)\z without beginning anchor as requested.
All 29,315 tests pass.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Oct 16, 2025

@copilot, they wrote:

(A)(\s)\z also returns empty captures, test this case as well
if (A)(\s)\z works then i believe it should be correct,
...

Added test case for (A)(\s)\z without beginning anchor. The fix (commit 0a5e63f) now conditions the special '\n' handling on _containsEndZAnchor, so it only applies for \Z anchor as you suggested. All patterns now work correctly:

  • (A)(\s)\z
  • ^(A)(\s)\z
  • ^(A)(\s)
  • ^(A)(\s)$
  • ^(A)(\s)\Z

All 29,315 tests pass.

@stephentoub stephentoub removed their assignment Oct 17, 2025
@stephentoub stephentoub enabled auto-merge (squash) October 17, 2025 16:16
@jeffhandley jeffhandley requested a review from tarekgh October 17, 2025 17:24
Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stephentoub stephentoub merged commit aa9c20b into main Oct 17, 2025
84 of 86 checks passed
@stephentoub stephentoub deleted the copilot/fix-regex-capture-groups branch October 17, 2025 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Newline causes empty regex capture groups when NonBacktracking in dotnet 9 and above

4 participants